2 - 28.2. Passive Learning [ID:30400]

50 von 80 angezeigt

So one of the things that I would like to briefly get into is we call passive learning.

So we make things simple.

We go for a fully observable environment, partially observable reward function, and

we want to find optimal policies.

It's just basically MDP with a wrinkle, and basically what we have to do, and that's sufficient,

is we have to do the same as we do in the policy evaluation subtasks of policy iteration

for MDPs.

We want to find out what is the best policy here.

We can do that with our example we also had in policy iteration.

Say we have this kind of environment where we had experimented with the rewards, and

you remember that we had policies here given by the arrows, namely what you should do when

you get into that field, and you have these two plus and minus one rewards.

In one of these examples we had a minus 0.4 disincentive for loitering around.

That gives rise to this optimal policy.

You don't quite want to run away here, and you want to prefer the plus one exit.

This policy, remember, was if you have a full reward function, actually gives rise to the

utilities of being in a certain place.

How do you find out?

Well, you just imagine yourself in that, in a certain state, and then you run according

to the policy.

In this case if you're in 1-1 you would run up and then over and land in plus one, and

if you add up the reward, which is plus one minus a couple of times minus 0.04, then you

end up with these values.

If we only have a partially observable reward function, let's say where we can't really

get these little things, then you can still do things.

What you would do is you would run what we call trials.

As an agent you can actually put yourself into the situation, play a game of chess,

live a life, study AI or whatsoever, and then at some point you're going to get the reward.

In this case, say when you hit an exit.

What you would do, you would make trials where you start since state 1-1 and until it goes

into a reinforcement state, one where you actually get a reinforcement, or for that

matter just go until the end.

Then you sense the rewards and then you reason backwards from that.

That's the idea.

You have a couple of trials, I've written down a couple of those, a couple of paths

through this policy.

Then you can just define what the utility is.

You just basically see at some point you see the reward and that gives you a utility and

that you would think of as a sample of the utility function.

Just like we had sampling in Markov search where you basically ran all the way down,

which means simulate a game, see whether you win or lose, and then you use that as a sample

for the utilities upstairs.

You can define a utility which you can sample and that you can use for learning.

What we've essentially have done, we've defined the delayed reward which we can sample and

then if we have enough samples, then that gives us enough utilities so that we can do

value iteration essentially and use MDP techniques to do unsupervised learning

or reinforcement learning.

That's something we've done quite a lot where you basically use another technique, lift

it up one level and use the old algorithm on some kind of an induced learning space.

That's something you can do and that's called direct utility estimation and the algorithms

Teil einer Videoserie :

Artificial Intelligence (AI-2) SS 2021

Teil eines Kapitels:

Chapter 28. Reinforcement Learning

Presenters

Prof. Dr. Michael Kohlhase

Zugänglich über

Offener Zugang

Dauer

00:10:46 Min

Aufnahmedatum

2021-03-30

Hochgeladen am

2021-03-31 08:16:52

Sprache

en-US

Explanation of Passive Learning using an example. Also, Adaptive Dynamic Programming and its algorithm is discussed.

Einbetten

Wordpress FAU Plugin

 https://www.fau.tv/clip/id/30400

iFrame

<iframe src="https://api.video.uni-erlangen.de/services/oembed/?url=https://www.fau.tv/clip/id/30400&format=iframe&maxwidth=1280&maxheight=720" width="1280" height="720"seamless allowfullscreen style="border: 0; padding: 0; margin: 0;overflow: hidden;"></iframe>

Herunterladen

Video

Per RSS abonnieren